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Abstract 

We investigate the problem of sequentially predicting the binary labels on the nodes of 
an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, 
the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by 
the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown 
adversarial labehng of the graph nodes. In deriving our characterization, we obtain a simple 
randomized algorithm achieving in expectation the optimal mistake bound on any polynomi- 
ally connected weighted graph. Our algorithm draws a random spanning tree of the original 
graph and then predicts the nodes of this tree in constant expected amortized time and linear 
space. Experiments on real-world datasets show that our method compares well to both global 
(Perceptron) and local (label propagation) methods, while being generally faster in practice. 

1 Introduction 

A widespread approach to the solution of classification problems is representing datasets through 
a weighted graph where nodes are the data items and edge weights quantify the similarity between 
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pairs of data items. This technique for coding input data has been applied to several domains, 
including Web spam detection |fT9l . classification of genomic data [|27l . face recognition [fTOl . and 
text categorization [1131 . In many applications, edge weights are computed through a complex data- 
modelling process and typically convey information that is relevant to the task of classifying the 
nodes. 

In the sequential version of this problem, nodes are presented in an arbitrary (possibly adversar- 
ial) order, and the learner must predict the binary label of each node before observing its true value. 
Since real-world applications typically involve large datasets (i.e., large graphs), online learning 
methods play an important role because of their good scaling properties. An interesting special 
case of the online problem is the so-called transductive setting, where the entire graph structure 
(including edge weights) is known in advance. The transductive setting is interesting in that the 
learner has the chance of reconfiguring the graph before learning starts, so as to make the problem 
look easier. This data preprocessing can be viewed as a kind of regularization in the context of 
graph prediction. 

When the graph is unweighted (i.e., when all edges have the same common weight), it was 
found in previous works [fTTl [T6l [T4l [T5l [H that a key parameter to control the number of online 
prediction mistakes is the size of the cut induced by the unknown adversarial labeling of the nodes, 
i.e., the number of edges in the graph whose endpoints are assigned disagreeing labels. However, 
while the number of mistakes is obviously bounded by the number of nodes, the cutsize scales 
with the number of edges. This naturally led to the idea of solving the prediction problem on a 
spanning tree of the graph [|71 dH [T9ll, whose number of edges is exactly equal to the number of 
nodes minus one. Now, since the cutsize of the spanning tree is smaller than that of the original 
graph, the number of mistakes in predicting the nodes is more tightly controlled. In light of the 
previous discussion, we can also view the spanning tree as a "maximally regularized" version of 
the original graph. 

Since a graph has up to exponentially many spanning trees, which one should be used to max- 
imize the predictive performance? This question can be answered by recalling the adversarial 
nature of the online setting, where the presentation of nodes and the assignment of labels to them 
are both arbitrary. This suggests to pick a tree at random among all spanning trees of the graph 
so as to prevent the adversary from concentrating the cutsize on the chosen tree [7|. Kirchoff's 
equivalence between the effective resistance of an edge and its probability of being included in a 
random spanning tree allows to express the expected cutsize of a random spanning tree in a simple 
form. Namely, as the sum of resistances over all edges in the cut of G induced by the adversarial 
label assignment. 

Although the results of fT\ yield a mistake bound for arbitrary unweighted graphs in terms of 
the cutsize of a random spanning tree, no general lower bounds are known for online unweighted 
graph prediction. The scenario gets even more uncertain in the case of weighted graphs, where the 
only previous papers we are aware of [fT6l [T4l [TSll essentially contain only upper bounds. In this 
paper we fill this gap, and show that the expected cutsize of a random spanning tree of the graph 
delivers a convenient parametrizatioij^ that captures the hardness of the graph learning problem 
in the general weighted case. Given any weighted graph, we prove that any online prediction 
algorithm must err on a number of nodes which is at least as big as the expected cutsize of the 

' Different parametrizations of the node prediction problem exist that lead to bounds which are incomparable to 
ours — see Section|2| 
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graph's random spanning tree (which is defined in terms of the graph weights). Moreover, we 
exhibit a simple randomized algorithm achieving in expectation the optimal mistake bound to 
within logarithmic factors. This bound applies to any sufficiently connected weighted graph whose 
weighted cutsize is not an overwhelming fraction of the total weight. 

Following the ideas of 0, our algorithm first extracts a random spanning tree of the original 
graph. Then, it predicts all nodes of this tree using a generalization of the method proposed by 
ifTSll . Our tree prediction procedure is extremely efficient: it only requires constant amortized 
time per prediction and space linear in the number of nodes. Again, we would like to stress that 
computational efficiency is a central issue in practical applications where the involved datasets can 
be very large. In such contexts, learning algorithms whose computation time scales quadratically, 
or slower, in the number of data points should be considered impractical. 

As in IfTSll . our algorithm first linearizes the tree, and then operates on the resulting line graph 
via a nearest neighbor rule. We show that, besides running time, this linearization step brings 
further benefits to the overall prediction process. In particular, similar to [16, Theorem 4.2], the 
algorithm turns out to be resilient to perturbations of the labeling, a clearly desirable feature from 
a practical standpoint. 

In order to provide convincing empirical evidence, we also present an experimental evaluation 
of our method compared to other algorithms recently proposed in the literature on graph prediction. 
In particular, we test our algorithm against the Perceptron algorithm with Laplacian kernel by 
lfT6l[T9ll . and against a version of the label propagation algorithm by JSQ. These two baselines can 
viewed as representatives of global (Perceptron) and local (label propagation) learning methods on 
graphs. The experiments have been carried out on five medium-sized real-world datasets. The two 
tree-based algorithms (ours and the Perceptron algorithm) have been tested using spanning trees 
generated in various ways, including committees of spanning trees aggregated by majority votes. 
In a nutshell, our experimental comparison shows that predictors based on our online algorithm 
compare well to all baselines while being very efficient in most cases. 

The paper is organized as follows. Next, we recall preliminaries and introduce our basic nota- 
tion. Section|2]surveys related work in the literature. In Section[3]we prove the general lower bound 
relating the mistakes of any prediction algorithm to the expected cutsize of a random spanning 
tree of the weighted graph. In the subsequent section, we present our prediction algorithm WTA 
(Weighted Tree Algorithm), along with a detailed mistake bound analysis restricted to weighted 
trees. This analysis is extended to weighted graphs in Section |5| where we provide an upper bound 
matching the lower bound up to log factors on any sufficiently connected graph. In Section |6} 
we quantify the robustness of our algorithm to label perturbation. In Section |7| we provide the 
constant amortized time implementation of WTA. Based on this implementation, in Section [8] we 
present the experimental results. Section[9]is devoted to conclusive remarks. 

1.1 Preliminaries and Basic Notation 

Let G = {V, W) be an undirected, connected, and weighted graph with n nodes and positive 
edge weights Wij > for (i, j) E E. A labeling of G is any assignment y = {yi, . . . ,yn) E 
{— 1, +1}" of binary labels to its nodes. We use (G, y) to denote the resulting labeled weighted 
graph. 

The online learning protocol for predicting (G, y) can be defined as the following game be- 
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tween a (possibly randomized) learner and an adversary. The game is parameterized by the graph 
G = (y, E, W). Preliminarly, and hidden to the learner, the adversary chooses a labeling y of 
G. Then the nodes of G are presented to the learner one by one, according to a permutation of V, 
which is adaptively selected by the adversary. More precisely, at each time step t = 1, . . . ,n the 
adversary chooses the next node it in the permutation of V, and presents it to the learner for the 
prediction of the associated label yi^ . Then is revealed, disclosing whether a mistake occurred. 
The learner's goal is to minimize the total number of prediction mistakes. Note that while the ad- 
versarial choice of the permutation can depend on the algorithm's randomization, the choice of the 
labeling is oblivious to it. In other words, the learner uses randomization to fend off the adversarial 
choice of labels, whereas it is fully deterministic against the adversarial choice of the permutation. 
The requirement that the adversary is fully oblivious when choosing labels is then dictated by the 
fact that the randomized learners considered in this paper make all their random choices at the 
beginning of the prediction process (i.e., before seeing the labels). 

Now, it is reasonable to expect that prediction performance degrades with the increase of "ran- 
domness" in the labeling. For this reason, our analysis of graph prediction algorithms bounds from 
above the number of prediction mistakes in terms of appropriate notions of graph label regularity. 
A standard notion of label regularity is the cutsize of a labeled graph, defined as follows. A 0-edge 
of a labeled graph (G, y) is any edge such that yi ^ yj. Similarly, an edge is 0-free if 
yi = yj. Let E'^ C Ehe the set of 0-edges in (G, y). The quantity $g(?/) = l-^"^] is the cutsize of 
(G, y), i.e., the number of 0-edges in E'^ (independent of the edge weights). The weighted cutsize 
of (G, t/) is defined by 

{i,j)£E<t> 

For a fixed (G, y), we denote by r]^ the effective resistance between nodes i and j of G. In the 
interpretation of the graph as an electric network, where the weights Wij are the edge conductances, 
the effective resistance rj^ is the voltage between i and j when a unit current flow is maintained 
through them. For E E, let also pij = Wijr^j be the probability that belongs to a 
random spanning tree T — see, e.g., the monograph of [|22|. Then we have 

where the expectation E is over the random choice of spanning tree T. Observe the natural weight- 
scale independence properties of ([T]). A uniform rescaling of the edge weights Wij cannot have 
an influence on the probabilities Pij, thereby making each product Wi^rYj scale independent. In 
addition, since j)eEPi,j equal to n — 1, irrespective of the edge weighting, we have < 
E$t(?/) < n — 1. Hence the ratio ^^E$t(?/) G [0, 1] provides a density-independent measure 
of the cutsize in G, and even allows to compare labelings on different graphs. 

Now contrast E$r(t/) to the more standard weighted cutsize measure $^(t/). First, ^^{y) 
is clearly weight-scale dependent. Second, it can be much larger than n on dense graphs, even in 
the unweighted Wij = 1 case. Third, it strongly depends on the density of G, which is generally 
related to j)eE ^^^^^ ^ ^riv) can be much smaller than ^^(y) when there are strongly 

connected regions in G contributing prominently to the weighted cutsize. To see this, consider the 
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following scenario: If G E'^ and Wij is large, then gives a big contribution to 
However, this does not necessarily happen with E ^Tiv)- In fact, if i and j are strongly connected 
(i.e., if there are many disjoint paths connecting them), then rj^ is very small and so are the terms 



w 



in (|lj). Therefore, the effect of the large weight Wij may often be compensated by the small 
probability of including (z, j) in the random spanning tree. See Figure[T]for an example. 

A different way of taking into account graph connectivity is provided by the covering ball 
approach taken by lfT4l [TSl -see the next section. 




Figure 1: A barbell graph. The weight of the two thick black edges is equal to \/V, all the other 
edges have unit weight. If the two labels yi and y2 are such that yi ^ y2, then the contribution 
of the edges on the left clique Ci to the cutsizes and ^^{y) must be large. However, 

since the probability of including each edge of Ci in a random spanning tree T is 
Ci's contribution to E^xiv) is \V\ times smaller than ^Ciiv) = ^c^iiv)- V'i 1/4' '^hen the 
contribution of edge (3,4) to is large. Because this edge is a bridge, the probability of 

including it in T is one, independent of W3 4. Indeed, we have ^3 4 = 14^3,4 = 1^13,4/^3,4 = 1. 
If Uh 7^ 2/6 J then the contribution of the right clique C2 to ^^{y) is large. On the other hand, the 
probability of including edge (5, 6) in T is equal to ^5,6 = w^^ = (9(1/a/|V^). Hence, the 
contribution of (5, 6) to E $t(2/) is small because the large weight of (5, 6) is offset by the fact that 
nodes 5 and 6 are strongly connected (i.e., there are many different paths among them). Finally, 
note that pi^j = 0{1/\V\) holds for all edges (i, j) in C2, implying (similar to clique Ci) that C2's 
contribution to E $t(^/) is \ V\ times smaller than $^(?/). 



2 Related Work 

With the above notation and preliminaries in hand, we now briefly survey the results in the existing 
literature which are most closely related to this paper. Further comments are made at the end of 
Section m 

Standard online linear learners, such as the Perceptron algorithm, are applied to the general 
(weighted) graph prediction problem by embedding the n vertices of the graph in M" through a 
map i H-> fr~^/^ej, where Cj G M" is the i-th vector in the canonical basis of W\ and K is a 
positive definite n x n matrix. The graph Perceptron algorithm [fTTl [T6l uses K = Lq + 1 1^, 
where Lq is the (weighted) Laplacian of G and 1 = (1, . . . , 1). The resulting mistake bound is of 
the form ^^(y)DQ^, where Dq^ = maxi,^ r|j is the resistance diameter of G. As expected, this 

^ It is easy to see that in such cases {y) can be much larger than n. 
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bound is weight-scale independent, but the interplay between the two factors in it may lead to a 
vacuous result. At a given scale for the weights Wij, if G is dense, then we may have = C(l) 
while ^Q^y) is of the order of n^. If G is sparse, then ^^(y) = 0{n) but then may become 
as large as n. 

The idea of using a spanning tree to reduce the cutsize of G has been investigated by [fT9l|. where 
the graph Perceptron algorithm is applied to a spanning tree T of G. The resulting mistake bound is 
of the form ^^{y)D^, i.e., the graph Perceptron bound applied to tree T. Since ^^iv) ^ ^civ) 
this bound has a smaller cutsize than the previous one. On the other hand, can be much larger 
than Dq because removing edges may increase the resistance. Hence the two bounds are generally 
incomparable. 

[fT9l suggest to apply the graph Perceptron algorithm to the spanning tree T with smallest 
geodesic diameter. The geodesic diameter of a weighted graph G is defined by 



where the minimum is over all paths Hi j between i and j. The reason behind this choice of T is 
that, for the spanning tree T with smallest geodesic diameter, it holds that D}p < 2A^. However, 
one the one hand Dq^ < , so there is no guarantee that D}p = O (-0^^) , and on the other hand 
the adversary may still concentrate all 0-edges on the chosen tree T, so there is no guarantee that 
remains small either. 

[fT8l introduce a different technique showing its application to the case of unweighted graphs. 
After reducing the graph to a spanning tree T, the tree is linearized via a depth-first visit. This 
gives a line graph S (the so-called spine of G) such that ^s{y) < 2 By running a Nearest 

Neighbor (NN) predictor on 5", [fTSll prove a mistake bound of the form ^s{y) loginj^siy)) + 
^s{y)- As observed by [fTTI . similar techniques have been developed to solve low-congestion 
routing problems. 

Another natural parametrization for the labels of a weighted graph that takes the graph structure 
into account is clusterability, i.e., the extent to which the graph nodes can be covered by a few 
balls of small resistance diameter. With this inductive bias in mind, [fT4ll developed the Pounce 
algorithm, which can be seen as a combination of graph Perceptron and NN prediction. The 
number of mistakes has a bound of the form 

mm{Af{G,p) + ^^{y)p) (2) 

where M{G, p) is the smallest number of balls of resistance diameter p it takes to cover the nodes 
of G. Note that the graph Perceptron bound is recovered when p = . Moreover, observe that, 
unlike graph Perceptron's, bound ([2]) is never vacuous, as it holds uniformly for all covers of G 
(even the one made up of singletons, corresponding to p — t- 0). A further trick for the unweighted 
case proposed by ifTSl is to take advantage of both previous approaches (graph Perceptron and NN 
on line graphs) by building a binary tree on G. This "support tree" helps in keeping the diameter 
of G as small as possible, e.g., logarithmic in the number of nodes n. The resulting prediction 
algorithm is again a combination of a Perceptron-like algorithm and NN, and the corresponding 
number of mistakes is the minimum over two earlier bounds: a NN-based bound of the form 
$c(y)(logn)^ and an unweighted version of bound 



max min 

(r,s)eni,- 
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Generally speaking, clusterability and resistance-weighted outsize E$r(t/) exploit the graph 
structure in different ways. Consider, for instance, a barbell graph made up of two m-cliques 
joined by k unweighted 0-edges with no endpoints in common (hence k < m)j^ If m is much 
larger than k, then bound ([2]) scales linearly with k (the two balls in the cover correspond to the 
two m-cliques). On the other hand, E$t(?/) tends to be constant: Because m is much larger 
than k, the probability of including any 0-edge in T tends to 1 /A;, as m increases and k stays 
constant. On the other hand, if k gets close to m the resistance diameter of the graph decreases, 
and ^ becomes a constant. In fact, one can show that when k = m even E $t(?/) is a constant, 
independent of m. In particular, the probability that a </)-edge is included in the random spanning 
tree T is upper bounded by 7;^^^, i-C-, E — ^ 3 when m grows largej^ 

When the graph at hand has a large diameter, e.g., an m-line graph connected to an m-clique 
(this is sometimes called a "lollipop" graph) the gap between the covering-based bound ([2]) and 
E $t(?/) is magnified. Yet, it is fair to say that the bounds we are about to prove for our algorithm 
have an extra factor, beyond E <l>x(^/), which is logarithmic in m. A similar logarithmic factor is 
achieved by the combined algorithm proposed in [18]. 

An even more refined way of exploiting cluster structure and connectivity in graphs is contained 
in the paper of [[T5ll . where the authors provide a comprehensive study of the application of dual- 
norm techniques to the prediction of weighted graphs, again with the goal of obtaining logarithmic 
performance guarantees on large diameter graphs. In order to trade-off the contribution of cutsize 
and resistance diameter Dq^, the authors develop a notion of p-norm resistance. The obtained 
bounds are dual norm versions of the covering ball bound (|2]). Roughly speaking, one can select 
the dual norm parameter of the algorithm to obtain a logarithmic contribution from the resistance 
diameter at the cost of squaring the contribution due to the cutsize. This quadratic term can be 
further reduced if the graph is well connected. For instance, in the unweighted barbell graph 
mentioned above, selecting the norm appropriately leads to a bound which is constant even when 
k m. 

Further comments on the comparison between the results presented by [fTSl and the ones in our 
paper are postponed to the end of Section [5} 

Departing from the online learning scenario, it is worth mentioning the significantly large liter- 
ature on the general problem of learning the nodes of a graph in the train/test transductive setting: 
Many algorithms have been proposed, including the label-consistent mincut approach of [|4l [51 
and a number of other "energy minimization" methods — e.g., the ones by [|3Tl l2ll of which label 
propagation is an instance. See the work of [3J for a relatively recent survey on this subject. 

Our graph prediction algorithm is based on a random spanning tree of the original graph. The 
problem of drawing a random spanning tree of an arbitrary graph has a long history — see, e.g., the 
recent monograph by [|22l|. In the unweighted case, a random spanning tree can be sampled with a 
random walk in expected time O(nlnn) for "most" graphs, as shown by jH. Using the beautiful 
algorithm of [30], the expected time reduces to 0{n) — see also the work of [[Q. However, all 
known techniques take expected time Q(n^) on certain pathological graphs. In the weighted case, 
the above methods can take longer due to the hardness of reaching, via a random walk, portions 
of the graph which are connected only via light- weighted edges. To sidestep this issue, in our 

^ This is one of the examples considered in |[T5l . 
This can be shown by computing the effective resistance of (p-edge as the minimum, over all unit-strength 
flow functions with i as source and j as sink, of the squared flow values summed over all edges, see, e.g., Il22l . 
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probability of including 
eacin edge in T 



Figure 2: The adversarial strategy. Numbers on edges are the probabilities Pij of those edges 
being included in a random spanning tree for the weighted graph under consideration. Numbers 
within nodes denote the weight of that node based on the pij — see main text. We set the budget 
K to 6, hence the subset S contains the 6 nodes having smallest weight. The adversary assigns a 
random label to each node in S thus forcing \ S\/2 mistakes in expectation. Then, it labels all nodes 
in V \ S with a unique label, chosen in such a way as to minimize the cutsize consistent with the 
labels previously assigned to the nodes of S. 

experiments we tested a viable fast approximation where weights are disregarded when building 
the spanning tree, and only used at prediction time. Finally, the space complexity for generating a 
random spanning tree is always linear in the graph size. 

To conclude this section, it is worth mentioning that, although we exploit random spanning 
trees to reduce the cutsize, similar approaches can also be used to approximate the cutsize of a 
weighted graph by sparsification — see, e.g., the work of [26J. However, because the resulting 
graphs are not as sparse as spanning trees, we do not currently see how to use those results. 

3 A General Lower Bound 

This section contains our general lower bound. We show that any prediction algorithm must err at 
least |E $t(2/) times on any weighted graph. 

Theorem 1. Let G = (y,E, W) be a weighted undirected graph with n nodes and weights Wij > 
for G E. Then for all K < n there exists a randomized labeling y of G such that for all 
(deterministic or randomized) algorithms A, the expected number of prediction mistakes made by 
A is at least K/2, while E < K. 

Proof. The adversary uses the weighting P induced by W and defined by pij = Wijr^j. By (1), 
Pij is the probability that edge {i, j) belongs to a random spanning tree T of G. Let Pi = J2j Pi,j 
be the sum over the induced weights of all edges incident to node i. We call Pj the weight of node 
i. Let S C V he the set of K nodes i in G having the smallest weight Pi. The adversary assigns a 
random label to each node i E S. This guarantees that, no matter what, the algorithm A will make 
on average K/2 mistakes on the nodes in S. The labels of the remaining nodes in \ S* are set 
either all +1 or all —1, depending on which one of the two choices yields the smaller $^(t/). See 
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Figure[2]for an illustrative example. We now show that the weighted outsize ^ciy) of this labeling 
y is less than K, independent of the labels of the nodes in S. 

Since the nodes in y \ S* have all the same label, the 0-edges induced by this labeling can only 
connect either two nodes in S or one node in S and one node in V \ S*. Hence ^Q^y) can be written 
as 

*^(?/) = <f5""(?/) + <^'r*(2/) 

where is the cutsize contribution within S, and is the one from edges between 

S and V\S. We can now bound these two terms by combining the definition of S with the equality 
j)eEPiJ = ^ — 1 as in the sequel. Let 

P's'= E and 

From the very definition of P'^^ and we have < P™*. Moreover, from the way 

the labels of nodes in V \ S' are selected, it follows that (^^^'^^{y) < Pf^ /2. Finally, 



ext 
S 



holds, since each edge connecting nodes in S is counted twice in the sum J2ies P^- Putting every- 
thing together we obtain 

i&S i&V (i,j)&E 

the inequality following from the definition of S. Hence 

concluding the proof. □ 



4 The Weighted Tree Algorithm 

We now describe the Weighted Tree Algorithm (WTA) for predicting the labels of a weighted tree. 
In Section [5] we show how to apply WTA to the more general weighted graph prediction problem. 
WTA first transforms the tree into a line graph (i.e., a list), then runs a fast nearest neighbor method 
to predict the labels of each node in the line. Though this technique is similar to that one used 
by ifTSl . the fact that the tree is weighted makes the analysis significantly more difficult, and the 
practical scope of our algorithm significantly wider. Our experimental comparison in Section [8] 
confirms that exploiting the weight information is often beneficial in real-world graph prediction 
problem. 

Given a labeled weighted tree (T, y), the algorithm initially creates a weighted line graph L' 
containing some duplicates of the nodes in T. Then, each duplicate node (together with its incident 
edges) is replaced by a single edge with a suitably chosen weight. This results in the final weighted 
line graph L which is then used for prediction. In order to create L from T, WTA performs the 
following tree linearization steps: 
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Figure 3: Top: A weighted graph G with 9 nodes. Initially, WTA extracts a random spanning tree 
T out of G. The weights on the edges in T are the same as those of G. Middle: The spanning tree 

T is linearized through a depth-first traversal starting from an arbitrary node (node 2 in this figure). 
For simplicity, we assume the traversal visits the siblings from left to right. As soon as a node 
is visited it gets stored in a line graph L' (first line graph from top). Backtracking steps produce 
duplicates in L' of some of the nodes in T. For instance, node 7 is the first node to be duplicated 
when the visit backtracks from node 8. The duplicated nodes are progressively eliminated from L' 
in the order of their insertion in L'. Several iterations of this node elimination process are displayed 
from the top to the bottom, showing how L' is progressively shrunk to the final line L (bottom line). 
Each line represents the elimination of a single duplicated node. The crossed nodes in each line 
are the nodes which are scheduled to be eliminated. Each time a new node j is eliminated, its two 
adjacent nodes i and k are connected by the lighter of the two edges (i, j) and (j, k). For instance: 
the left-most duplicated 7 is dropped by directly connecting the two adjacent nodes 8 and 1 by an 
edge with weight 1/2; the right-most node 2 is eliminated by directly connecting node 6 to node 9 
with an edge with weight 1/2, and so on. Observe that this elimination procedure can be carried 
out in any order without changing the resulting list L. Bottom: We show WTA's prediction on the 
Une L so obtained. In this figure, the numbers above the edges denote the edge weights, the ones 
below are the resistors, i.e., weight reciprocals. We are at time step t — 3 where two labels have 
so far been revealed (gray nodes). WTA predicts on the remaining nodes according to a nearest 
neighbor rule on L, based on the resistance distance metric. All possible predictions made by wta 
at this time step are shown. 
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1 . An arbitrary node r of T is chosen, and a line L' containing only r is created. 

2. Starting from r, a depth-first visit of T is performed. Each time an edge (i, j) is traversed 
(even in a backtracking step) from i to j, the edge is appended to L' with its weight wij, and 
i becomes the current terminal node of V . Note that backtracking steps can create in L' at 
most one duplicate of each edge in T, while nodes in T may be duplicated several times in 
L'. 

3. L' is traversed once, starting from terminal r. During this traversal, duplicate nodes are 
eliminated as soon as they are encountered. This works as follows. Let j be a duplicate 
node, and (j', j) and (j, j") be the two incident edges. The two edges are replaced by a new 
edge having weight Wjij// = min|wj/j, Wjj//}|^Let L be the resulting line. 



The analysis of Section 4.1 shows that this choice of Wj/jf guarantees that the weighted cutsize of 
L is smaller than twice the weighted cutsize of T. 

Once L is created from T, the algorithm predicts the label of each node it using a nearest- 
neighbor rule operating on L with a resistance distance metric. That is, the prediction on it is 
the label of is*, being s* = aigmmg^td-iisAt) the previously revealed node closest to it, and 
d{i,j) = Yl^=i l/^t's.i's+i is the sum of the resistors (i.e., reciprocals of edge weights) along the 
unique path i = fi f 2 Vk+i = j connecting node i to node j. Figure [3] gives an 

example of wta at work. 

4.1 Analysis of WTA 

The following lemma gives a mistake bound on WTA run on any weighted line graph. Given any 
labeled graph (G, y), we denote by Rq the sum of resistors of 0-free edges in G, 

rg= y —■ 

(i,j)eE\E4' ''^ 

Also, given any 0-free edge subset E' C E \ E'^, we define E^{-^E') as the sum of the resistors 
of all (/)-free edges mE\ {E^ U E'), 



(i,j}eE\{EfuE') 



Note that B}q{-^E') < B}q , since we drop some edges from the sum in the defining formula. 

Finally, we use f = g as shorthand for / = 0{g). The following lemma is the starting point of 
our theoretical investigation — please see Appendix A for proofs. 

Lemma 2. If wta is run on a labeled weighted line graph (L, y), then the total number rriL of 
mistakes satisfies 



^ By iterating this elimination procedure, it might happen that more than two adjacent nodes get ehminated. In this 
case, the two surviving terminal nodes are connected in L by the lightest edge among the eliminated ones in L' . 
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bound the contribution of any set of K resistors in RY at the cost of adding K extra mistakes. 
We now provide an upper bound on the number of mistakes made by WTA on any weighted tree 
T = {V, E, W) in terms of the number of 0-edges, the weighted cutsize, and . 

Theorem 3. If wta is run on a labeled weighted tree (T, y), then the total number itlt of mistakes 
satisfies 



for all subsets E'ofE\ E^. 

The logarithmic factor in the above bound shows that the algorithm takes advantage of labelings 
such that the weights of 0-edges are small (thus making ^}^(y) small) and the weights of 0-free 
edges are high (thus making R}^ small). This matches the intuition behind WTA's nearest-neighbor 
rule according to which nodes that are close to each other are expected to have the same label. In 
particular, observe that the way the above quantities are combined makes the bound independent 
of rescaling of the edge weights. Again, this has to be expected, since WTA's prediction is scale 
insensitive. On the other hand, it may appear less natural that the mistake bound also depends 
linearly on the cutsize <I>t(?/). independent of the edge weights. The specialization to trees of 
our lower bound (Theorem [T] in Section [3]) implies that this linear dependence of mistakes on the 
unweighted cutsize is necessary whenever the adversarial labeling is chosen from a set of labelings 
with bounded $t(2/)- 

5 Predicting a Weighted Graph 

In order to solve the more general problem of predicting the labels of a weighted graph G, one can 
first generate a spanning tree T of G and then run WTA directly on T. In this case, it is possible 
to rephrase Theorem [3] in terms of the properties of G. Note that for each spanning tree T of G, 
^T^iv) < (y) ^^^^riy) < ^dy)- Specific choices of the spanning tree T control in different 
ways the quantities in the mistake bound of Theorem [3] For example, a minimum spanning tree 
tends to reduce the value of R^ , betting on the fact that 0-edges are light. The next theorem relies 
on random spanning trees. 

Theorem 4. If wta is run on a random spanning tree T of a labeled weighted graph {G, y), then 
the total number mo of mistakes satisfies 





(3) 



where w'^^^ = max Wjj. 
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We now compare the mistake bound Q to the lower bound stated in Theorem [T] In particular, 
we prove that wta is optimal (up to log n factors) on every weighted connected graph in which the 
0-edge weights are not "superpolynomially overloaded" w.r.t. the 0-free edge weights. In order 
to rule out pathological cases, when the weighted graph is nearly disconnected, we impose the 
following mild assumption on the graphs being considered. 

We say that a graph is polynomially connected if the ratio of any pair of effective resistances 
(even those between nonadjacent nodes) in the graph is polynomial in the total number of nodes 
n. This definition essentially states that a weighted graph can be considered connected if no pair 
of nodes can be found which is substantially less connected than any other pair of nodes. Again, 
as one would naturally expect, this definition is independent of uniform weight rescaling. The 
following corollary shows that if vv'TA is not optimal on a polynomially connected graph, then the 
labeling must be so irregular that the total weight of 0-edges is an overwhelming fraction of the 
overall weight. 

Corollary 5. Pick any polynomially connected weighted graph G with n nodes. If the ratio of the 
total weight of(j)-edges to the total weight of(j)-free edges is bounded by a polynomial in n, then the 
total number of mistakes mc made by WTA when run on a random spanning tree TofG satisfies 

Emc = E[$t(?/)] logn. 

Note that when the hypothesis of this corollary is not satisfied the bound of WTA is not neces- 
sarly vacuous. For example, E [i?^^] w^^^^^ = r2P°iyi°s(") implies an upper bound which is optimal 
up to polylog(n) factors. In particular, having a constant number of (/)-free edges with exponen- 
tially large resistance contradicts the assumption of polynomial connectivity, but it need not lead 
to a vacuous bound in Theorem|4j In fact, one can use Lemma[2]to drop from the mistake bound of 
Theorem|4]the contribution of any set of 0{1) resistances in E]i?^] = j)^E\E<i> ^Yj the cost 
of adding just 0{\) extra mistakes. This could be seen as a robustness property of WTA's bound 
against graphs that do not fully satisfy the connectedness assumption. 

We further elaborate on the robustness properties of WTA in Section [6} In the meanwhile, note 
how Corollary [5] compares to the expected mistake bound of algorithms like graph Perceptron (see 
Section |2]) on the same random spanning tree. This bound depends on the expectation of the prod- 
uct $^(7/)D^, where D^^ is the diameter of T in the resistance distance metric. Recall from 
the discussion in Section [2] that these two factors are negatively correlated because {y) de- 
pends linearly on the edge weights, while depends linearly on the reciprocal of these weights. 
Moreover, for any given scale of the edge weights, D^^ can be linear in the number n of nodes. 

Another interesting comparison is to the covering ball bounds of [[T4l[T5l . Consider the case 
when G is an unweighted tree with diameter D. Whereas the dual norm approach of [|15J gives 
a mistake bound of the form $g(?/)^ log-D, our approach, as well as the one by [fTSl . yields 
^G{y) logn. Namely, the dependence on $g(2/) becomes linear rather than quadratic, but the 
diameter D gets replaced by n, the number of nodes in G. Replacing nhy D seems to be a ben- 
efit brought by the covering ball approach]^ More generally, one can say that the covering ball 
approach seems to allow to replace the extra logn term contained in Corollary [5] by more refined 
structural parameters of the graph (like its diameter D), but it does so at the cost of squaring the 
dependence on the cutsize. A typical (and unsurprising) example where the dual-norm covering 

* As a matter of fact, a bound of the form $0(2/) log D on unweighted trees is also achieved by the direct analysis 
ofQ. 
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ball bounds are better then the one in Corollary |5] is when the labeled graph is well-clustered. One 
such example we already mentioned in Section [2} On the unweighted barbell graph made up of 
m-cliques connected by A; ^ m 0-edges, the algorithm of [15| has a constant hound on the num- 
ber of mistakes (i.e., independent of both m and k), the Pounce algorithm has a linear bound in 
k, while Corollary [5] delivers a logarithmic bound in m + A;. Yet, it is fair to point out that the 
bounds of [|T4l[T5ll refer to computationally heavier algorithms than WTA: Pounce has a determin- 
istic initialization step that computes the inverse Laplacian matrix of the graph (this is cubic in n, 
or quadratic in the case of trees), the minimum j9)-seminorm interpolation algorithm of f\5\ 
has no initialization, but each step requires the solution of a constrained convex optimization prob- 
lem (whose time complexity was not quantified by the authors). Further comments on the time 
complexity of our algorithm are given in Section [7] 



6 The Robustness of WTA to Label Perturbation 

In this section we show that WTA is tolerant to noise, i.e., the number of mistakes made by WTA on 
most labeled graphs (G, y) does not significantly change if a small number of labels are perturbed 
before running the algorithm. This is especially the case if the input graph G is polynomially 
connected (see Section |5] for a definition). 

As in previous sections, we start off from the case when the input graph is a tree, and then we 
extend the result to general graphs using random spanning trees. 

Suppose that the labels y in the tree (T, y) used as input to the algorithm have actually been 
obtained from another labeling y' of T through the perturbation (flipping) of some of its labels. 
As explained at the beginning of Section |4} WTA operates on a line graph L obtained through the 
linearization process of the input tree T. The following theorem shows that, whereas the cutsize 
differences {y) — $5^(y')| and |$t(2/) — '^t(?/')I tree T can in principle be very large, the 
cutsize differences |$^(?/) - $^(?/')| and \(^L{y)- $l (?/')! on the line graph L built by WTA are 
always small. 

In order to quantify the above differences, we need a couple of ancillary definitions. Given a 
labeled tree (T, y), define C,t{K) to be the sum of the weights of the K heaviest edges in T, 



Ct{K) = max > 

E"ZE:\E'\=K ^ 



If T is unweighted we clearly have C,t{K) = K. Moreover, given any two labelings y and y' 
of T's nodes, we let 5{y, y') be the number of nodes for which the two labelings differ, i.e., 
Ky^y') = \{i = l,...,n : Vi ^y'-}\ . 

Theorem 6. On any given labeled tree (T, y) the tree linearization step of WTA generates a line 
graph L such that: 

1. $r(t/) < min 2($5^(t/') + Ct(5(2/,Z/'))) / 

2. ^L{y)< min 2{^T{y') + S{y,y')). 

3/'e{-i,+i}" 
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In order to highlight the consequences of WTA's linearization step contained in Theorem |6| 
consider as a simple example an unweighted star graph (T, y) where all labels are +1 except for 
the central node c whose label is —1. We have = n—1, but flipping the sign of we would 

obtain the star graph (T,y') with ^x{y') = 0. Using Theorem |6] (item 2) we get < 2. 

Hence, on this star graph vv^TA's linearization step generates a line graph with a constant number 
of 0-edges even if the input tree T has no 0-free edges. Because flipping the labels of a few nodes 
(in this case the label of c) we obtain a tree with a much more regular labeling, the labels of those 
nodes can naturally be seen as corrupted by noise. 

The following theorem quantifies to what extent the mistake bound of WTA on trees can take 
advantage of the tolerance to label perturbation contained in Theorem [6j Introducing shorthands 
for the right-hand side expressions in Theorem [6[ 



^^{y) = min 2 (<(?/') + Ct(5(?/, ?/'))) 

■!/'e{-i,+i}" V / 



v'e{-i,+i} 
and 

$t(2/)= min 2{^Tiy') + 5iy,y')) , 

j/'e{-i,+i}" 

we have the following robust version of Theorem [3} 

Theorem 7. If wta is run on a weighted and labeled tree (T, y), then the total number tjit of 
mistakes satisfies 



rriT = $t(2/) ( 1 + log I 1 + ' ^ (V)' j j ^ ^^^'^^ ^ 

for all subsets E'ofE\ E'K 

As a simple consequence, we have the following corollary. 

Corollary 8. If wta is run on a weighted and polynomially connected labeled tree (T, y), then 
the total number rriT of mistakes satisfies 

rriT = $t(2/) logn . 

Theorem|7]combines the result of Theorem |3] with the robustness to label perturbation of WTA's 
tree linearization procedure. Comparing the two theorems, we see that the main advantage of the 
tree linearization lies in the mistake bound dependence on the logarithmic factors occurring in the 
formulas: Theorem|7]shows that, when $r(?/) ^ ^riy), then the performance of WTA can be just 
hnear in Theorem |3] shows instead that the dependence on $t(?/) is in general superlinear 

even in cases when flipping few labels of y makes the cutsize $r(2/) decrease in a substantial 
way. In many cases, the tolerance to noise allows us to achieve even better results: Corollary [8] 
states that, if T is polynomially connected and there exists a labeling y' with small 5{y, y') such 
that is much smaller than then the performance of WTA is about the same as if 

the algorithm were run on (T, y'). In fact, from Lemma [2] we know that when T is polynomially 
connected the mistake bound of WTA mainly depends on the number of 0-edges in (L, y), which 
can often be much smaller than those in (T, y). As a simple example, let T be an unweighted star 
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graph with a labeling y and z be the difference between the number of +1 and the number of —1 
in y. Then the mistake bound of WTA is linear in 2;logn irrespective of $t(2/) and, specifically, 
irrespective of the label assigned to the central node of the star, which can greatly affect the actual 
value of $t(?/)- 

We are now ready to extend the above results to the case when WTA operates on a general 
weighted graph (G, y) via a uniformly generated random spanning tree T. As before, we need 
some shorthand notation. Define $^(2/) 



^h{y)= min [¥.[^T{y')\+5{y,y' 



j/'e{-i,+i} 

where the expectation is over the random draw of a spanning tree T of G. The following are the 
robust versions of Theorem]?] and Corollary ]5] 

Theorem 9. If wta is run on a random spanning tree T of a labeled weighted graph (G, y), then 
the total number uig of mistakes satisfies 

E = ^UV) (l + log (l + <axE [R'^] ) ) + E 
where w^ax = Taax. Wi 

{i,j)(^E'l> 

Corollary 10. If wta is run on a random spanning tree T of a labeled weighted graph {G, y) and 
the ratio of the weights of each pair of edges ofG is polynomial in n, then the total number tjig of 
mistakes satisfies 

Erne = $g(i/) logn . 

The relationship between Theorem ]9] and Theorem]?] is similar to the one between Theorem]?] 

and Theorem [sj When there exists a labeling y' such that 6{y,y') is small and E[$t(V)] ^ 
_r_ ^ _r_ ,-, ^ quan- 
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E , then Theorem |9] allows a linear dependence on E [$^(7/)] . Finally, Corollary 

tifies the advantages of WTA's noise tolerance under a similar (but stricter) assumption as the one 
contained in Corollary |5] 



7 Implementation 

As explained in Section]?} WTA runs in two phases: (i) a random spanning tree is drawn; (ii) the 
tree is linearized and labels are sequentially predicted. As discussed in Subsection |1.1[ Wilson's 
algorithm can draw a random spanning tree of "most" unweighted graphs in expected time 0{n). 
The analysis of running times on weighted graphs is significantly more complex, and outside the 
scope of this paper. A naive implementation of WTA's second phase runs in time 0{n \ogn) and 
requires linear memory space when operating on a tree with n nodes. We now describe how to 
implement the second phase to run in time 0{n), i.e., in constant amortized time per prediction 
step. 

Once the given tree T is linearized into an n-node line L, we initially traverse L from left 
to right. Call jo the left-most terminal node of L. During this traversal, the resistance distance 
d{jo, i) is incrementally computed for each node i in L. This makes it possible to calculate d{i, j) 
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Figure 4: Constant amortized time implementation of WTA. The line L has n = 27 nodes (the 
adjacent squares at the bottom). Shaded squares are the revealed nodes, connected through a dark 
grey doubly-linked list B. The depicted tree T' has both unmarked (white) and marked (shaded) 
nodes. The arrows indicate the traversal operations performed by WTA when predicting the label 
of node if. The upward traversal stops as soon as a marked ancestor a.nc{it) is found, and then a 
downward traversal begins. Note that WTA first descends to the left, and then keeps going right all 
the way down. Once i' is determined, a single step within B suffices to determine i". 

in constant time for any pair of nodes, since d{i,j) = \d{jo,i) — d(jo,j) \ for all i,j G L. On top 
of L, a complete binary tree T' with 2l^'°S2"^ leaves is constructedjj The k-th leftmost leaf (in the 
usual tree representation) of T' is the k-th node in L (numbering the nodes of L from left to right). 
The algorithm maintains this data-structure in such a way that at time t: (i) the subsequence of 
leaves whose labels are revealed at time t are connected through a (bidirectional) list B, and (ii) all 
the ancestors in T' of the leaves of B are marked. See Figure |4} 

When WTA is required to predict the label y^^, the algorithm looks for the two closest revealed 
leaves i' and i" oppositely located in L with respect to it. The above data structure supports this 
operation as follows. WTA starts from it and goes upwards in T' until the first marked ancestor 
anc(zt) of it is reached. During this upward traversal, the algorithm marks each internal node of T' 
on the path connecting it to anc(it). Then, WTA starts from anc(it) and goes downwards in order to 
find the leaf i' E B closest to it. Note how the algorithm uses node marks for finding its way down: 
For instance, in Figure|4]the algorithm goes left since anc(it) was reached from below through the 
right child node, and then keeps right all the way down to i'. Node i" (if present) is then identified 
via the links in B. The two distances d{it, i') and d{it, i") are compared, and the closest node to it 
within B is then determined. Finally, WTA updates the links of B by inserting it between i' and i" . 

In order to quantify the amortized time per trial, the key observation is that each internal node 
k of T' gets visited only twice during upward traversals over the n trials: The first visit takes place 
when k gets marked for the first time, the second visit of k occurs when a subsequent upward 
visit also marks the other (unmarked) child of k. Once both of /c's children are marked, we are 
guaranteed that no further upward visits to k will be performed. Since the preprocessing operations 
take 0{n), this shows that the total running time over the n trials is linear in n, as anticipated. Note, 

^ For simplicity, this description assumes n is a power of 2. If this is not the case, we could add dummy nodes to 
L before building T' . 
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however, that the worst-case time per trial is C(log n). For instance, on the very first trial T' has to 
be traversed all the way up and down. 

This is the way we implemented WTA on the experiments described in the next section. 



8 Experiments 

We now present the results of an experimental comparison on a number of real-world weighted 
graphs from different domains: text categorization, optical character recognition, spam detection 
and bioinformatics. Although our theoretical analysis is for the sequential prediction model, all ex- 
periments are carried out using a more standard train-test scenario. This makes it easy to compare 
WTA against popular non- sequential baselines, such as Label Propagation. 

We compare our algorithm to the following other methods, intended as representatives of two 
different ways of coping with the graph prediction problem: global vs. local prediction. 

Perceptron with Laplacian kernel. Introduced by [fT6ll and here abbreviated as gpa (graph 
Perceptron algorithm). This algorithm sequentially predicts the nodes of a weighted graph G = 
(y, E) after mapping V via the linear kernel based on + 11^, where Lq is the laplacian matrix 
of G. Following [fT9l, we run GPA on a spanning tree T of the original graph. This is because a 
careful computation of the Laplacian pseudoinverse of a n-node tree takes time 0(n + + mD) 
where m is the number of training examples plus the number of test examples (labels to predict), 
and D is the tree diameter — see the work of [|T9il for a proof of this fact. However, in most of our 
experiments m = n, implying a running time of 9(n^) for GPA. 

Note that GPA is a global approach, in that the graph topology affects, via the inverse Laplacian, 
the prediction on all nodes. 

Weighted Majority Vote. Introduced here and abbreviated as WMV. Since the common under- 
lying assumption to graph prediction algorithms is that adjacent nodes are labeled similarly, a very 
intuitive and fast algorithm for predicting the label of a node i is via a weighted majority vote on 
the available labels of the adjacent nodes. More precisely, WMV predicts using the sign of 



where yj = if node j is not available in the training set. The overall time and space requirements 
are both of order 9(|i?|), since we need to read (at least once) the weights of all edges. WMV is 
also a local approach, in the sense that prediction at each node is only affected by the labels of 
adjacent nodes. 

Label Propagation. Introduced by [|3T1l and here abbreviated as lab prop. This is a batch trans- 
ductive learning method based on solving a (possibly sparse) linear system of equations which 
requires 6(mn) time on an n-node graph with m edges. This bad scalability prevented us from 
carrying out comparative experiments on larger graphs of 10^ or more nodes. Note that WMV can 
be viewed as a fast approximation of labprop. 
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In our experiments, we combined WTA and GPA with spanning trees generated in different ways 
(note that WMV and labprop do not use spanning trees). 

Random Spanning Tree (RST). Each spanning tree is taken with probability proportional to the 
product of its edge weights — see, e.g., fT2}, Chapter 4]. In addition, we also tested WTA combined 
with RST generated by ignoring the edge weights (which were then restored before running WTA). 
This second approach gives a prediction algorithm whose total expected running time, including 
the generation of the spanning tree, is 0(n) on most graphs. We abbreviate this spanning tree as 
NWRST (non-weighted RST). 

Depth-first spanning tree (dfst). This spanning tree is created via the following randomized 
depth-first visit: A root is selected at random, then each newly visited node is chosen with prob- 
ability proportional to the weights of the edges connecting the current vertex with the adjacent 
nodes that have not been visited yet. This spanning tree is faster to generate than RST, and can be 
viewed as an approximate version of RST. 

Minimum Spanning Tree (mst). The spanning tree minimizing the sum of the resistors of all 
edges. This is the tree whose Laplacian best approximates the Laplacian of G according to the 
trace norm criterion — see, e.g., the paper of [ 191 . 

Shortest Path Spanning Tree (SPST). |fT9l use the shortest path tree because it has a small 
diameter (at most twice the diameter of G). This allows them to better control the theoretical 
performance of GPA. We generated several shortest path spanning trees by choosing the root node 
at random, and then took the one with minimum diameter. 

In order to check whether the information carried by the edge weight has predictive value for 
a nearest neighbor rule like WTA, we also performed a test by ignoring the edge weights during 
both the generation of the spanning tree and the running of WTA's nearest neighbor rule. This is 
essentially the algorithm analyzed by [18], and we denote it by NWWTA (non-weighted WTA). We 
combined NWWTA with weighted and unweighted spanning trees. So, for instance, NWWTA-i-RST 
runs a 1-NN rule (nwwta) that does not take edge weights into account (i.e., pretending that all 
weights are unitary) on a random spanning tree generated according to the actual edge weights. 
NWWTA-i-NWRST runs NWWTA on a random spanning tree that also disregars edge weights. 

Finally, in order to make the classifications based on RST's more robust with respect to the 
variance associated with the random generation of the spanning tree, we also tested committees 
of RST's. For example, K*wta-i-rst denotes the classifier obtained by drawing K rst's, running 
WTA on each one of them, and then aggegating the predictions of the K resulting classifiers via a 
majority vote. For our experiments we chose K = 7, 11, 17. 

We ran our experiments on five real- world datasets: 

RCVl. The first 10,000 document^ (in chronological order) of Reuters Corpus Volume 1, with 
TF-IDF preprocessing and Euclidean normalization. 

^ Available at tree . nist . gov/data/ reuters/ reuters . html. 
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USPS. The USPS dataselQ with features normalized into [0, 2] . 



KROGAN. This is a high-throughput protein-protein interaction network for budding yeast. It 
has been used by 11211 and fi23ll . 

COMBINED. A second dataset from the work of [|23l . It is a combination of three datasets: 
[[nil's, Il20l's, and flU's. 

WEBSPAM. A large dataset ( 1 10,900 nodes and 1,836, 1 36 edges) of inter-host links created for 
the{^Web Spam Challenge 2008 [24] . This is a weighted graph with binary labels and a pre-defined 
train/test split: 3,897 training nodes and 1,993 test nodes (the remaining ones being unlabeled). 

We created graphs from RCVl and USPS with as many nodes as the total number of examples 
ix,,yi) in the datasets. That is, 10,000 nodes for RCVl and 7291-1-2007 = 9298 for USPS. Fol- 
lowing previous experimental settings [l3Tl l2ll, the graphs were constructed using k-NN based on 
the standard Euclidean distance ||a3i — a;j|| between node i and node j. The weight Wij was set to 
Wij = exp(— ||a3i — ajjll^ f'^lj)' 3 '^^^ '■^^ ^ nearest neighbors of i, and otherwise. To 
set afj, we first computed the average square distance between i and its k nearest neighbors (call 
it af), then we computed (t| in the same way, and finally set af j = (^af + cr|) /2. We generated 
two graphs for each dataset by running A;-NN with k = 10 (RCVl-10 and USPS-10) and k = 100 
(RCVl- 100 and USPS- 100). The labels were set using the four most frequent categories in RCVl 
and all 10 categories in USPS. 

In KROGAN and COMBINED we only considered the biggest connected components of both 
datasets, obtaining 2,169 nodes and 6,102 edges for KROGAN, and 2,871 nodes and 6,407 edges 
for COMBINED. In these graphs, each node belongs to one or more classes, each class represent- 
ing a gene function. We selected the set of functional labels at depth one in the FunCat classifica- 
tion scheme of the MIPS database [l25l . resulting in seventeen classes per dataset. 

In order to associate binary classification tasks with the six non-binary datasets/graphs (RCVl- 
10, RCVl-lOO, USPS-10, USPS-100, KROGAN, COMBINED) we binarized the corresponding 
multiclass problems via a standard one-vs-rest scheme. We thus obtained: four binary classification 
tasks for RCVl-10 and RCVl-lOO, ten binary tasks for USPS-10 and USPS-100, seventeen binary 
tasks for both KROGAN and COMBINED. For a given a binary task and dataset, we tried different 
proportions of training set and test set sizes. In particular, we used training sets of size 5%, 10%, 
25% and 50%. For any given size, the training sets were randomly selected. 

We report error rates and F-measures on the test set, after macro-averaging over the binary 
tasks. The results are contained in Tables [l]-[7] (Appendix |9]) and in Figures [5]-[6j Specifically, 
Tables [l]-[6] contain results for all combinations of algorithms and train/test split for the first six 
datasets (i.e., all but WEBSPAM). 

The WEBSPAM dataset is very large, and requires us a lot of computational resources in order 
to run experiments on this graph. Moreover, GPA has always shown inferior accuracy performance 

Available at www-i6 . inf ormatik . rwth-aachen . de/ "keysers/ usps . html. 

The dataset is available at barcelona . research . yahoo . net/webspam/datasets/. We do not com- 
pare our results to those obtained in the challenge since we are only exploiting the graph (weighted) topology here, 
disregarding content features. 



20 



than the corresponding version of WTA (i.e., the one using the same kind of spanning tree) on all 
other datasets. Hence we decided not to go on any further with the refined implementation of GPA 
on trees we mentioned above. In Table |7] we only report test error results on the four algorithms 
WTA, WMV, LABPROP, and WTA with a committee of seven (nonweighted) random spanning trees. 

In our experimental setup we tried to control the sources of variance in the first six datasets as 
follows: 

1. We first generated ten random permutations of the node indices for each one of the six 
graphs/datasets; 

2. on each permutation we generated the training/test splits; 

3. we computed MST and SPST for each graph and made (for WTA, GPA, WMV, and LABPROP) 
one run per permutation on each of the 4+4+10+10+17+17 = 62 binary problems, averaging 
results over permutations and splits; 

4. for each graph, we generated ten random instances for each one of RST, NWRST, DFST, 
and then operated as in step 2, with a further averaging over the randomness in the tree 
generation. 

Figure [5] extracts from Tables [l]-[6] the error levels of the best spanning tree performers, and com- 
pared them to WMV and LABPROP. For comparison purposes, we also displayed the error levels 
achieved by WTA operating on a committee of seventeen random spanning trees (see below). Fig- 
ure [6] (left) contains the error level on WEBSPAM reported in Table |7} Finally, Figure |6] (right) is 
meant to emphasize the error rate differences between RST and NWRST run with WTA. 
Several interesting observations and conclusions can be drawn from our experiments. 

1. WTA outperforms GPA on all datasets and with all spanning tree combinations. In particular, 
though we only reported aggregated results, the same relative performance pattern among 
the two algorithms repeats systematically over all binary classification problems. In addition, 
WTA runs significantly faster than GPA, requires less memory storage (linear in n, rather than 
quadratic), and is also fairly easy to implement. 

2. By comparing NWWTA to WTA, we see that the edge weight information in the nearest neigh- 
bor rule increases accuracy, though only by a small amount. 

3. WMV is a fast and accurate approximation to LABPROP when either the graph is dense 
(RCVl-lOO, and USPS-100) or the training set is comparatively large (25%-50%), although 
neither of the two situations often occurs in real-world applications. 

4. The best performing spanning tree for both WTA and GPA is MST. This might be explained 
by the fact that MST tends to select light 0-edges of the original graph. 

5. NWRST and DFST are fast approximations to RST. Though the use of NWRST and DFST 
does not provide theoretical performance guarantees as for RST, in our experiments they do 
actually perform comparably. Hence, in practice, NWRST and DFST might be viewed as fast 
and practical ways to generate spanning trees for WTA. 
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Figure 5: Macroaveraged test error rates on the first six datasets as a function of the training set 
size. The results are extracted from Tables [l]-[6] in Appendix B. Only the best performing spanning 
tree (i.e., mst) is shown for the algorithms that use spanning trees. These results are compared to 

WMV, LABPROP, and 17*WTA+RST. 
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Figure 6: Left: Error rate levels on WEBSPAM taken from Table |7] in Appendix |9j Right: 
Average error rate difference across datasets when using WTA+NWRST rather than WTA+RST. 

6. The prediction performance of WTA+MST is sometimes slightly inferior to LABPROP's. How- 
ever, it should be stressed that labprop takes time Q{mn), where m is the number of edges, 
whereas a single sweep of wta+mst over the graph just takes time 0{m log n) Commit- 
tees of spanning trees are a simple way to make WTA approach, and sometimes surpass, the 
performance of LABPROP. One can see that on sparse graphs using committees gives a good 
performances improvement. In particular, committees of WTA can reach the same perfor- 
mances of LABPROP while adding just a constant factor to their (linear) time complexity. 



9 Conclusions and Open Questions 

We introduced and analyzed WTA, a randomized online prediction algorithm for weighted graph 
prediction. The algorithm uses random spanning trees and has nearly optimal performance guaran- 
tees in terms of expected prediction accuracy. The expected running time of WTA is optimal when 
the random spanning tree is drawn ignoring edge weigths. Thanks to its linearization phase, the 
algorithm is also provably robust to label noise. 

Our experimental evaluation shows that WTA outperforms other previously proposed online 
predictors. Moreover, when combined with an aggregation of random spanning trees, WTA also 
tends to beat standard batch predictors, such as label propagation. These features make WTA (and 
its combinations) suitable to large scale applications. 

There are two main directions in which this work can improved. First, previous analyses fj\ 
reveal that wta's analysis is loose, at least when the input graph is an unweighted tree with small 
diameter. This is the main source of the i7(ln |\^|) slack between WTA upper bound and the general 
lower bound of Theorem[T] So we ask whether, at least in certain cases, this slack could be reduced. 
Second, in our analysis we express our upper and lower bounds in terms of the cutsize. One may 
object that a more natural quantity for our setting is the weighted cutsize, as this better reflects the 
assumption that 0-edges tend to be light, a natural notion of bias for weighted graphs. In more 

" The MST of a graph G = (V, E) can be computed in time Od-E] log \V\). Slightly faster implementations do 
actually exist which rely on Fibonacci heaps. 
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generality, we ask what are other criteria that make a notion of bias better than another one. For 
example, we may prefer a bias which is robust to small perturbations of the problem instance. In 
this sense ^q, the cutsize robust to label perturbation introduced in Section |6| is a better bias than 
E $r. We thus ask whether there is a notion of bias, more natural and robust than E $r, which 
captures as tightly as possible the optimal number of online mistakes on general weighted graphs. 
A partial answer to this question is provided by the recent work of [|29ll . It would also be nice to tie 
this machinery with recent results in the active node classification setting on trees contained in flU. 

Acknowledgments This work was supported in part by Google Inc. through a Google Research 
Award, and by the PASCAL2 Network of Excellence under EC grant 216886. This publication 
only reflects the authors views. 



Appendix A 



This appendix contains the proofs of Lemma [2} Theorem |3} Theorem]?} Corollary |5} Theorem |6} 
Theorem |7} Corollary [8| Theorem |9| and Corollary [lOj Notation and references are as in the main 
text. We start by proving Lemma [2} 

Lemma m Let a cluster he any maximal sub-line of L whose edges are all 0-free. Then L contains 
exactly + 1 clusters, which we number consecutively, starting from one of the two terminal 

nodes. Consider the k-th cluster c^. Let vq be the first node of cj. whose label is predicted by WTA. 
After y^,,, is revealed, the cluster splits into two edge-disjoint sub-lines c'l^ and c'^, both having vq as 
terminal node |^ Let f ^ and v'^ be the closest nodes to Vq such that (i) y^'^ = y^'i^ ^ y^^ and (ii) v'^. is 
adjacent to a terminal node of c'^, and v'l is adjacent to a terminal node of c'^. The nearest neighbor 
prediction rule of WTA guarantees that the first mistake made on c'^^ (respectively, c'^) must occur 
on a node vi such that d{vQ, vi) > d(vi, t>^) (respectively, d{vQ, vi) > d(vi, v'l)). By iterating this 
argument for the subsequent mistakes we see that the total number of mistakes made on cluster 
is bounded by 



1 + 



log; 



[W, 



+ 



log; 



R'l + 



Wi, 



[W'l 



where i?'^ is the resistance diameter of sub-line c'^, and is the weight of the </)-edge between 
and the terminal node of c'^. closest to it (i?'^' and w'l^ are defined similarly). Hence, summing the 



above displayed expression over clusters k 



^L{y) + 1 we obtain 



rriL 



o 



o 



<^L{y) 1 + log 1 + 



R'k'^'k I + log 1 1 + ^ ^, , -Rfc'w^fc 

•^y^k ) \ ^L^y)k 



With no loss of generality, we assume that neither of the two sub-Unes is empty, so that vq is not a terminal node 

of Cfc. 
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where in the second step we used Jensen's inequality and in the last one the fact that ^^(-Rfc + 



and maxfe w'u 



o 



y). This proves the lemma in the case 



K) 

E' = 0. 

Li order to conclude the proof, observe that if we take any semi-cluster c')^ (obtained, as before, 
by splitting cluster Ck, being vq E Ck the first node whose label is predicted by WTA), and pretend to 
split it into two sub-clusters connected by a (/)-free edge, we could repeat the previous dichotomic 
argument almost verbatim on the two sub-clusters at the cost of adding an extra mistake. We now 
make this intuitive argument more precise. Let (i, j) be a (/)-free edge belonging to semi-cluster 
c';., and suppose without loss of generality that i is closer to vq than to j. If we remove edge (i, j) 
then c'^ splits into two subclusters: c'j^{vQ) and c'^(j), containing node vq and j, respectively (see 
Figure |7|. Let m^'^, rnc'^i^^^) and mc'^(j) be the number of mistakes made on d^, c'f^{vQ) and c'^(j), 
respectively. We clearly have nic'^ = nic'^i^^^) + rric'^^ij)- 




Figure 7: We illustrate the way we bound the number of mistakes on semi-cluster c'^ by dropping 
the resistance contribution of any (possibly very light) edge at the cost of increasing the 

mistake bound on c'^ by 1. The removal of makes c'j^ split into subclusters c'^(fo) and c'^(j). 
We can then drop edge by making node i coincide with node j. The resulting semi-cluster 
is denoted 7^. This shortened version of c'^ can be viewed as split into sub-cluster 7fc(fo) and 
subcluster 7^(7), corresponding to c'^(fo) and c'^(j), respectively. Now, the number of mistakes 
made on c'^(fo) and c'^(j) can be bounded by those made on 7fc(fo) and 7^(7). Hence, we can 
bound the mistakes on c'^ through the ones made on 7^,, with the addition of a single mistake, 
rather than two, due to the double node z = j of 7^. 

Let now 7^ be the semi-cluster obtained from c'^ by contracting edge so as to make i 
coincide with j (we sometimes write i = j). Cluster 7^ can be split into two parts which overlap 
only at node i = j: 7[,(fo), with terminal nodes vq and i (coinciding with node j), and 7fc(j). In 
a similar fashion, let m^^, m^^(^p), and 'Ti^^(j) be the number of mistakes made on 7^, 7fc(fo) and 
7[(j), respectively. We have my^ = my^(^^^^ + my^j) — 1, where the —1 takes into account that 
7^(fo) and 7^(7) overlap at node i = j. 

Observing now that, for each node v belonging to c'^(fo) (and 7fc(fo)), the distance d{v,v'^) 
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is smaller on 7/,. than on c'^, we can apply the abovementioned dichotomic argument to bound the 
mistakes made on c'^, obtaining my^i^yg) < nic'^f^^^). Since "^c'^(j) = "^7^o> we can finally write 
nic'^ = rnc'^i^^^) + "^c'^^q) < "m-y'^ivo) + "^7^0) = ""^7^ + 1- Iterating this argument for all edges in E' 
concludes the proof. □ 

In view of proving Theorem [3} we now prove the following two lemmas. 

Lemma 11. Given any tree T, let E{T) be the edge set ofT, and let E{L') and E{L) be the edge 
sets of line graphs V and L obtained via WTA'5 tree linearization ofT. Then the following holds. 

1. There exists a partition Vl' of E{L') in pairs and a bijective mapping fi^/ : Vl' — ?■ E(T) 
such that the weight of both edges in each pair S' G Vl' is equal to the weight of the edge 

2. There exists a partition Vl of E{L) in sets S such that \ S\ < 2, and there exists an injective 
mapping Hl '■'Pl E(T) such that the weight of the edges in each pair S is equal to 
the weight of the edge fiiiS). 

Proof. We start by defining the bijective mapping jiii : Vl' — ?■ E{T). Since each edge (i, j) of T is 
traversed exactly twice in the depth-first visit that generates L'j^once in a forward step and once in 
a backward step, we partition E{L') in pairs S' such that /i^/ (S") = (2, j) if and only if 5" contains 
the pair of distinct edges created in V by the two traversals of (i, j). By construction, the edges in 
each pair S' have weight equal to hl'{S'). Moreover, this mapping is clearly bijective, since any 
edge of L' is created by a single traversal of an edge in T. The second mapping fi^ : V{L) E(T) 
is created as follows. Vl is created from Vl' by removing from each S' E Vu the edges that are 
eliminated when L' is transformed into L. Note that we have \Vl\ < I'Pl'I and for any S E Vl 
there is a unique S' G Vu such that S C S'. Now, for each S E Vl let p,LiS) = fiL'iS'), where 
S' is such that S C S'. Since /i^/ is bijective, ^l is injective. Moreover, since the edges in 5" have 
the same weight as the edge fj^L'iS'), the same property holds for fXL- D 

Lemma 12. Let (T, y) be a labeled tree, let (L, y) be the linearization ofT, and let V be the line 
graph with duplicates (as described above). Then the following holds^^ 

1. ^Y{y) < ^^,{y) < 2^^{y); 

2. ^L{y) < <fL'(?/) <2$T(t/). 



Proof. From Lemma 1 1 (part 1) we know that L' contains a duplicated edge for each edge of T. 
This immediately implies ^L'iv) < 2$t(2/) and $^(?/) < 2^f{y). 

To prove the remaining inequalities, note that from the description of WTA in Section|4](step 3), 
we see that when L' is transformed into L the pair of edges (j', j) and (j, j") of L', which are 
incident to a duplicate node j, gets replaced in L (together with j) by a single edge Now 
each such edge cannot be a 0-edge in L unless either (j, j') or (j, j") is a 0-edge in L', 

and this establishes < Finally, if is a 0-edge in L, then its weight is 

not larger than the weight of the associated 0-edge in L' (step 3 of WTA), and this establishes 

For the sake of simplicity, we are assuming here that the depth-first visit of T terminates by backtracking over all 
nodes on the path between the last node visited in a forward step and the root. 
^'^ Item 2 in this lemma is essentially contained in the paper by ifTSl . 
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Recall that, given a labeled graph G = {V, E) and any 0-free edge subset E' C E \ E'^, the 
quantity R^{^E') is the sum of the resistors of all 0-free edges in \ (E'^ U E'). 

Lemma 13. If wta is run on a weighted line graph (L, y) obtained through the linearization of a 
given labeled tree (T, y) with edge set E, then the total number ttit of mistakes satisfies 



TTLt 



o 



where E' is an arbitrary subset of E\ E^. 



Proof. Lemma 1 1 (Part 2), exhibits an injective mapping fii : V ^ E, where V is a partition of 
the edge set E{L) of L, such that every S eV satisfies \S\ < 2. Hence, we have \E'{L)\ < 2\E'\, 
where E'{L) is the union of the pre-images of edges in E' according to jii — note that some edge in 
E' might not have a pre-image in E{L). By the same argument, we also establish \Eq{L)\ < 2$^, 
where Eq{L) is the set of 0-free edges of L that belong to elements S of the partition Vl such that 
1^l{S) E Et 

Since the edges of L that are neither in Eq{L) nor in E'(L) are partitioned by Vl in edge sets 
having cardinality at most two, which in turn can be injectively mapped via fiL to E\ {E'^ U E'), 

wehavei?f (^^(E'(L)UEo(L))) < 2i?5r(^E') . Finally, we use < 2|E'| and |Eo(L)| < 



2$j'(7/) (which we just established) and apply Lemma[2]with E' = E'{L)UEq{L). This concludes 
the proof. □ 



of Theorem^ We use Lemma [121 to establish < 2$r(2/) and $^(?/) < 2^Y{y). We then 

conclude with an application of Lemma [T3| □ 



Lemma 14. If wta is run on a weighted line graph {L, y) obtained through the linearization of 
random spanning tree T of a labeled weighted graph {G, y), then the total number ttig of mistakes 
satisfies 

^LxE[<])+E[<|.H?/)]) , 

where wt,^^ 



EmG = E[$i(?/)] (1 + log (1 + 1 
~' ^ we can write 



Proof. Using Lemma 13 with E' 



o 



EniG = E 



1 + log 1 



^L{y) 



T 



O 



O 



E 



$^(2/) 1 + log l + i?^m 



$1 



E[$i(2/)] (l + log (l+E[i?^] 



\wi 



where the second equality follows from the fact that $^(?/) < ^L{y)wf^^^, which in turn follows 
from Lemma[TT] and the third one follows from Jensen's inequality applied to the concave function 
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Theorem^ We apply Lemma 14 and then Lemma 12 to get < 2$t(2/)- □ 



Corollary^ Let / > poly(n) denote a function growing faster than any polynomial in n. Choose a 
polynomially connected graph G and a labeling y. For the sake of contradiction, assume that WTA 
makes more than 0(E log n) mistakes on {G, y). Then Theorem|4]implies w'l^^^E [R^] > 

poly(n). Since E[i?5r] = ^^.^^.^g^^^^ r|^, we have that w^^^ max(i_j)gs\s<^ r]^ > poly(n). To- 
gether with the assumption of polynomial connectivity for G, this implies W^^^r^j > poly(n) for 
all 0-free edges By definition of effective resistance, Wijr^j < 1 for all E E. This 

gives wf^.^Jwij > poly(?7.) for all 0-free edges which in turn implies 

> poly(n) . 

Z^{i,j)£E\E'l> '^id 

As this contradicts our hypothesis, the proof is concluded. □ 

Theorem^ We only prove the first part of the theorem. The proof of the second part corresponds 
to the special case when all weights are equal to 1. 

Let A(y,y') C V he the set of nodes i such that yi ^ y[. We therefore have 5{y,y') = 
I A(t/, y') I . Since in a line graph each node is adjacent to at most two other nodes, the label flip of 
any node j E A(7/, y') can cause an increase of the weighted cutsize of L by at most Wi'j + Wj^i", 
where i' and i" are the two nodes adjacent to j in Hence, flipping the labels of all nodes 
in A{y, y'), we have that the total cutsize increase is bounded by the sum of the weights of the 
25{y, y') heaviest edges in L, which implies 

$r(?/)<$f(?/') + a(2%, ?/')). 



By Lemma 12 < 2$^(u). Moreover, Lemma 11 gives an injective mapping hl '■ Vl ~> 

E (E is the edge set of T) such that the elements of V have cardinality at most two, and the weight 
of each edge HiiS) is the same as the weights of the edges in S. Hence, the total weight of the 
26{y, y') heaviest edges in L is at most twice the total weight of the 5{y, y') heaviest edges in T. 
Therefore Cl {2S{y, y')) < 2(T{S{y, y')) ■ Hence, we have obtained 

$r(2/)<2$5r(t;') + 2CT (%,?/')), 
concluding the proof. □ 



Theorem^ We use Theoremro^to bound and ^Yiv) the mistake bound of Lemma 13 



□ 

Corollary^ Recall that the resistance between two nodes i and j of any tree is simply the sum of 
the inverse weights over all edges on the path connecting the two nodes. Since T is polynomially 
connected, we know that the ratio of any pair of edge weights is polynomial in n. This implies that 
RY^Yiy) polynomial in n, too. We apply Theorem[6]to bound $L(t/) in the mistake bound of 
Lemma |2] with £" = 0. This concludes the proof. □ 

In the special case wfien j is terminal node we can set Wj^i" = 0. 
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Lemma 15. If wta is run on a line graph L obtained by linearizing a random spanning tree T of 
a labeled and weighted graph {G, y ), then we have 



Proof. Recall that Theorem [6] holds for any spanning tree T of G. Thus it suffices to apply part 2 
of Theorem [6] and use E [min X] < min E [X] . □ 



□ 



Theorem^ We apply Lemma 15 to bound E in Lemma 14 

Corollary^^ Since the ratio of the weights of any pair of edges in G is polynomial in n, the span- 
ning tree T must be polynomially connected. Thus we can use Corollary |8| and bound E [$l (y)] 
via Lemma [T5l □ 



Appendix B 

This appendix summarizes all our experimental results. For each combination of dataset, algo- 
rithm, and train/test split, we provide macro-averaged error rates and F-measures on the test set. 
The algorithms are vv^TA, NWWTA, and GPA (all combined with various spanning trees), WMV, 
LABPROP, and WTA run with committees of random spanning trees. WEBSPAM was too large 
a dataset to perform as thorough an investigation. Hence we only report test error results on the 
four algorithms WTA, WMV, labprop, and wta with a committee of 7 (nonweighted) random 
spanning trees. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 

WTA+MST 

WTA+SPST 

WTA+DFST 


25.54 0.81 
25.81 0.81 

21.09 0.84 
25.47 0.81 
26.02 0.81 


22.67 0.84 
22.70 0.83 
17.94 0.87 
22.65 0.83 
22.34 0.84 


19.06 0.86 
19.24 0.86 

13.93 0.90 
19.31 0.86 
17.73 0.87 


16.57 0.88 
17.00 0.87 
11.40 0.91 
17.24 0.87 
14.89 0.89 


NW WTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


25.28 0.81 
25.97 0.81 
21.18 0.84 
25.49 0.81 
26.08 0.81 


22.45 0.84 
23.14 0.83 
18.17 0.87 

22.81 0.83 

22.82 0.83 


19.12 0.86 
19.54 0.86 
14.51 0.89 
19.64 0.86 
17.93 0.87 


17.16 0.87 

17.84 0.87 
12.44 0.91 
17.55 0.87 
15.64 0.88 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


32.75 0.75 
34.27 0.74 
27.98 0.79 
27.18 0.79 
47.11 0.61 


29.85 0.78 
30.36 0.78 
24.89 0.82 
25.13 0.82 
45.65 0.64 


27.67 0.80 
28.90 0.79 
21.80 0.84 
22.20 0.84 
43.08 0.66 


24.44 0.82 
25.99 0.81 
20.27 0.85 
20.27 0.85 
38.20 0.71 


7*WTA+RST 
7*WTA+NWRST 


17.40 0.87 
17.81 0.87 


14.85 0.90 
15.15 0.89 


12.15 0.91 
12.51 0.91 


10.39 0.92 
10.92 0.92 


11*WTA+RST 
11*WTA+NWRST 


16.40 0.88 
16.78 0.88 


13.86 0.90 
14.22 0.90 


11.38 0.92 
11.73 0.92 


9.71 0.93 
10.20 0.93 


17*WTA+RST 
17*WTA+NWRST 


15.78 0.89 
16.07 0.89 


13.23 0.91 
13.55 0.90 


10.85 0.92 
11.18 0.92 


9.22 0.94 
9.65 0.93 


WMV 


31.82 0.76 


22.27 0.84 


11.82 0.91 


8.76 0.93 


LAB PROP 


16.33 0.89 


13.00 0.91 


10.00 0.93 


8.77 0.94 



Table 1 : RCVl-10 - Average error rate and F-measure on 4 classes. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 

WTA+MST 

WTA+SPST 

WTA+DFST 


32.03 0.77 
32.05 0.77 
20.45 0.85 
29.26 0.79 
32.03 0.77 


29.36 0.79 
29.89 0.78 
17.36 0.87 
27.06 0.80 
28.89 0.79 


26.09 0.81 
26.65 0.80 
13.91 0.90 
24.96 0.82 
24.18 0.82 


23.25 0.83 
23.82 0.83 
11.19 0.92 
23.17 0.83 
20.57 0.85 


NW WTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


31.72 0.77 

32.52 0.76 
20.54 0.85 
29.28 0.79 
32.05 0.77 


29.46 0.78 
29.95 0.78 
17.68 0.87 
27.13 0.80 
28.81 0.79 


26.20 0.81 
26.88 0.80 
14.37 0.89 
25.16 0.82 
24.14 0.82 


24.04 0.82 
24.84 0.82 
12.25 0.91 
23.72 0.83 
21.28 0.84 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


36.47 0.73 
38.26 0.72 
26.65 0.81 
32.43 0.74 
48.35 0.61 


35.33 0.74 
35.91 0.73 
24.30 0.82 
28.00 0.78 
47.85 0.61 


33.81 0.75 
35.20 0.74 
20.29 0.85 
26.61 0.79 
44.78 0.65 


32.32 0.76 
32.73 0.76 
18.75 0.86 
25.77 0.80 
41.12 0.68 


7*WTA+RST 
7*WTA+NWRST 


23.30 0.84 
23.64 0.84 


20.55 0.86 
20.77 0.86 


16.87 0.88 
17.27 0.88 


14.34 0.90 
14.81 0.90 


11*WTA+RST 
11*WTA+NWRST 


22.06 0.85 
22.29 0.85 


19.39 0.87 
19.54 0.87 


15.63 0.89 
16.09 0.89 


13.20 0.91 
13.61 0.91 


17*WTA+RST 
17*WTA+NWRST 


21.33 0.86 
21.49 0.86 


18.62 0.88 
18.86 0.87 


14.91 0.90 
15.29 0.89 


12.39 0.92 
12.78 0.91 


WMV 


12.48 0.91 


10.50 0.93 


9.49 0.93 


8.96 0.94 


LAB PROP 


24.39 0.85 


20.78 0.87 


14.45 0.91 


10.73 0.93 



Table 2: RCVl-lOO - Average error rate and F-measure on 4 classes. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 

WTA+MST 

WTA+SPST 

WTA+DFST 


5.32 0.97 
5.65 0.97 
1.98 0.99 
6.25 0.97 
6.43 0.96 


4.28 0.98 
4.51 0.97 
1.61 0.99 
4.72 0.97 
4.60 0.97 


3.08 0.98 
3.29 0.98 
1.24 0.99 
3.37 0.98 
2.92 0.98 


2.36 0.99 
2.56 0.98 
0.94 0.99 
2.60 0.99 
2.04 0.99 


NW WTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


5.31 0.97 
5.95 0.97 
1.99 0.99 
6.30 0.96 
6.49 0.96 


4.25 0.98 
4.65 0.97 
1.59 0.99 
4.83 0.97 
4.59 0.97 


3.19 0.98 
3.45 0.98 
1.29 0.99 
3.50 0.98 
3.09 0.98 


2.70 0.99 
2.92 0.98 
1.06 0.99 
2.84 0.98 
2.35 0.99 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


12.64 0.93 
12.53 0.93 
2.58 0.99 
7.64 0.96 
42.77 0.70 


8.53 0.95 
9.05 0.95 
3.18 0.98 
6.26 0.96 
39.39 0.73 


6.65 0.96 
6.90 0.96 
2.28 0.99 
4.13 0.98 
32.38 0.79 


5.65 0.97 
5.19 0.97 
1.48 0.99 
3.55 0.98 
20.53 0.87 


7*WTA+RST 
7*WTA+NWRST 


2.09 0.99 
2.35 0.99 


1.56 0.99 
1.75 0.99 


1.14 0.99 
1.26 0.99 


0.90 0.99 
1.02 0.99 


11*WTA+RST 
11*WTA+NWRST 


1.84 0.99 
2.05 0.99 


1.35 0.99 
1.53 0.99 


1.01 0.99 
1.14 0.99 


0.82 1.00 
0.91 0.99 


17*WTA+RST 
17*WTA+NWRST 


1.65 0.99 
1.87 0.99 


1.23 0.99 
1.39 0.99 


0.95 0.99 
1.06 0.99 


0.77 1.00 
0.85 1.00 


WMV 


24.84 0.85 


12.28 0.93 


2.13 0.99 


0.75 1.00 


LAB PROP 


2.14 0.99 


1.16 0.99 


0.85 0.99 


0.73 1.00 



Table 3: USPS-10 - Average error rate and F-measure on 10 classes. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 

WTA+MST 

WTA+SPST 

WTA+DFST 


9.62 0.95 
10.32 0.94 
1.90 0.99 
8.68 0.95 
10.36 0.94 


8.29 0.95 
9.00 0.95 
1.49 0.99 
7.27 0.96 
8.13 0.96 


6.55 0.96 
7.17 0.96 

1.22 0.99 
5.78 0.97 
5.62 0.97 


5.36 0.97 
5.83 0.97 
0.94 0.99 
4.88 0.97 
4.21 0.98 


NW WTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


9.71 0.95 
10.39 0.94 
1.91 0.99 
8.76 0.95 
10.46 0.94 


8.38 0.95 
9.08 0.95 
1.60 0.99 
7.46 0.96 
8.30 0.95 


6.78 0.96 
7.46 0.96 
1.23 0.99 
5.94 0.97 
6.00 0.97 


5.89 0.97 
6.45 0.96 
1.09 0.99 
5.28 0.97 
4.65 0.97 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


14.81 0.91 
17.34 0.90 
3.57 0.98 
8.42 0.95 
46.09 0.67 


13.38 0.92 
13.68 0.92 
2.26 0.99 
7.94 0.95 
42.59 0.71 


11.94 0.93 
11.39 0.94 
1.77 0.99 
7.20 0.96 
37.66 0.75 


9.81 0.94 
11.46 0.94 
1.39 0.99 
5.71 0.97 
28.45 0.82 


7*WTA+RST 
7*WTA+NWRST 


5.28 0.97 
5.82 0.97 


4.24 0.98 
4.73 0.97 


3.05 0.98 
3.48 0.98 


2.37 0.99 
2.69 0.98 


11*WTA+RST 
11*WTA+NWRST 


5.07 0.97 
5.55 0.97 


3.96 0.98 
4.38 0.98 


2.76 0.99 
3.14 0.98 


2.11 0.99 
2.40 0.99 


17*WTA+RST 
17*WTA+NWRST 


5.17 0.97 
7.60 0.96 


3.96 0.98 
6.38 0.97 


2.72 0.99 
4.68 0.97 


2.05 0.99 
3.32 0.98 


WMV 


2.17 0.99 


1.70 0.99 


1.53 0.99 


1.45 0.99 


LAB PROP 


6.94 0.96 


5.19 0.97 


2.51 0.99 


1.79 0.99 



Table 4: USPS-100 - Average error rate and F-measure on 10 classes. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 

WTA+MST 

WTA+SPST 

WTA+DFST 


21.73 0.86 
21.86 0.86 
21.55 0.86 
21.86 0.86 
21.78 0.86 


21.37 0.86 
21.50 0.86 
20.86 0.87 
21.58 0.86 
21.22 0.86 


19.89 0.87 
20.03 0.87 
19.35 0.88 
20.38 0.87 
19.88 0.87 


19.09 0.88 
19.33 0.88 
18.36 0.88 
19.40 0.88 
18.60 0.88 


NW WTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


21.83 0.86 
21.98 0.86 
21.55 0.86 
21.86 0.86 
21.79 0.86 


21.43 0.86 
21.55 0.86 
20.91 0.87 
21.57 0.86 
21.33 0.86 


20.08 0.87 
20.26 0.87 
19.55 0.88 
20.50 0.87 
20.00 0.87 


19.64 0.88 
19.75 0.87 
18.89 0.88 
19.81 0.87 
19.09 0.88 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


22.70 0.85 
23.83 0.84 
21.99 0.86 
22.33 0.84 
39.77 0.72 


22.75 0.85 
23.28 0.85 
21.34 0.86 
21.34 0.86 
31.93 0.78 


22.14 0.86 
22.48 0.85 
20.77 0.86 
20.71 0.86 
25.70 0.83 


21.28 0.86 
21.53 0.86 
20.48 0.87 
20.74 0.86 
24.09 0.84 


7*WTA+RST 
7*WTA+NWRST 


16.83 0.90 
16.85 0.90 


16.63 0.90 
16.60 0.90 


15.78 0.90 
15.89 0.90 


15.29 0.90 
15.41 0.90 


11*WTA+RST 
11*WTA+NWRST 


16.28 0.90 
16.28 0.90 


16.11 0.90 
16.08 0.90 


15.36 0.91 
15.55 0.90 


14.92 0.91 
14.99 0.91 


17*WTA+RST 
17*WTA+NWRST 


15.93 0.90 
15.98 0.90 


15.78 0.90 
15.69 0.91 


15.17 0.91 
15.23 0.91 


14.63 0.91 
14.68 0.91 


WMV 


42.98 0.70 


38.88 0.73 


29.85 0.80 


22.66 0.85 


LAB PROP 


15.26 0.91 


15.21 0.91 


14.94 0.91 


15.13 0.91 



Table 5: KROGAN - Average error rate and F-measure on 17 classes. 

D.B. Wilson. Generating random spanning trees more quickly than the cover time. In Proc. 
of the 28th ACM Symposium on the Theory of Computing, pages 296-303. ACM Press, 1996. 

X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and 
harmonic functions. In ICML Workshop on the Continuum from Labeled to Unlabeled Data 
in Machine Learning and Data Mining, 2003. 
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Train/test split 
Predictors 


5% 
Error F 


10% 
Error F 


25% 
Error F 


50% 
Error F 


WTA+RST 

WTA+NWRST 
WTA+MST 
WTA+SPST 
WTA+DFST 


21.68 0.86 

21.47 0.87 
21 57 86 
21.39 0.87 
21.88 0.86 


21.05 0.87 

21.29 0.86 
20 63 87 
21.34 0.86 
21.09 0.87 


20.08 0.87 

20.18 0.87 
19 61 88 
20.52 0.87 
19.82 0.87 


18.99 0.88 

19.17 0.88 
18 37 88 
19.57 0.88 
18.83 0.88 


NWWTA+RST 

NWWTA+NWRST 

NWWTA+MST 

NWWTA+SPST 

NWWTA+DFST 


21.50 0.87 
21.61 0.86 
21 53 86 
21.37 0.87 

21 88 86 


21.15 0.87 
21.26 0.87 

20 95 87 
21.06 0.87 

21 05 87 


20.43 0.87 
20.52 0.87 
20 35 87 
20.55 0.87 

20 50 87 


19.95 0.87 
20.09 0.87 
19 81 88 
20.06 0.87 

19 74 88 


GPA+RST 

GPA+NWRST 

GPA+MST 

GPA+SPST 

GPA+DFST 


23.56 0.85 
23.91 0.85 
23.32 0.85 
22.55 0.85 
41.69 0.71 


21.11 0.86 
23.11 0.85 
21.60 0.86 
21.89 0.85 
30.82 0.79 


21.86 0.86 
22.47 0.86 
21.77 0.86 
21.64 0.85 
26.75 0.82 


21.68 0.86 
21.30 0.86 
21.67 0.86 

21.70 0.85 
23.56 0.84 


7*WTA+RST 
7*WTA+NWRST 


16.39 0.90 
16.35 0.90 


16.09 0.90 

16.10 0.90 


15.77 0.91 
15.77 0.90 


15.29 0.91 
15.47 0.91 


11*WTA+RST 
11*WTA+NWRST 


15.89 0.91 
15.82 0.91 


15.61 0.91 
15.57 0.91 


15.32 0.91 
15.34 0.91 


14.84 0.91 
14.98 0.91 


17*WTA+RST 
17*WTA+NWRST 


15.54 0.91 
15.45 0.91 


15.31 0.91 
15.29 0.91 


14.97 0.91 
15.05 0.91 


14.55 0.91 
14.66 0.91 


WMV 


44.74 0.68 


40.75 0.72 


32.97 0.78 


25.28 0.84 


LABPROP 


14.93 0.91 


14.98 0.91 


15.23 0.91 


15.31 0.90 



Table 6: COMBINED - Average error rate and F-measure on 17 classes. 



Predictors 


Error 


F 


WTA+NWRST 


10.03 


0.95 


3*WTA+NWRST 


6.44 


0.97 


7*WTA+NWRST 


5.91 


0.97 


WMV 


44.1 


0.71 


LABPROP 


12.84 


0.93 



Table 7: WEBSPAM - Test set error rate and F-measure. WTA operates only on NWRST. 
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